Note: I have generated 3D plots for both PCA and MDS. I used plot_ly package since it allows rotating, and visualization with those plots are better. However, at the very end, I realized that I can only show 16 plots of them at most in html file, but I don’t know the reason. I have 20 plots in total. Thus, Task 2 is in the “IE582_HW2_PART2.html” file.
The objective in this task is to understand if any significant information regarding the game outcomes can be obtained using the odd data from multiple bookmakers. The data used in this task is read below:
data_matches <- readRDS("/Users/elifkonyar/Desktop/IE 582/HW1/df9b1196-e3cf-4cc7-9159-f236fe738215_matches.rds")
data_odds <- readRDS("/Users/elifkonyar/Desktop/IE 582/HW1/df9b1196-e3cf-4cc7-9159-f236fe738215_odd_details.rds")
In this part, the outcome of the match in terms of whether that match finished over 2.5 or not is tried to be explained by the odds for different types of bets. In order to achieve this, for a bookmaker and for each game, a feature vector will be created of which elements correspond to the different odds information. Afterwards, Principal Component Analysis (PCA) will be performed in order to come up with a lower dimensional set of features since before omitting any of the features, the dimension of the feature vector is 117. Additionally, interesting information will be tried to be obtained after plotting the first 3 eigenvectors as color-coded where colors will represent the match outcomes.
bet_types <- data_odds[, .N, keyby = .(betType)]
odd_types <- data_odds[, .N, keyby = .(oddtype)]
# count the number of different odd types a bookmaker has
bookmakers <- data_odds[, .(oddtype = unique(oddtype)), keyby = .(bookmaker)]
bookmakers <- bookmakers[, .N, keyby = .(bookmaker)]
bookmakers <- bookmakers[order(-N)]
# chosen bookmakers: 10Bet, 1xBet, Betsson, 888sport,
# BetVictor
This strategy will be performed for 5 bookmakers. The chosen bookmakers are: 10Bet, 1xBet, Betsson, 888sport and BetVictor.
Firstly, PCA will be performed on the odds data from the bookmaker 10Bet. Since for a match, it is possible to have more than one odd for any odd type, the one with the highest timestamp is chosen. Therefore, the focus will be on the final odds.
Preprocessing steps are performed on the odd data from 10Bet. Since there are oddtype, bettype and total handicap columns all referring to some information related to the odd types, these columns are merged in the below code.
bm1 <- data_odds[bookmaker == "10Bet"]
# choose the final odds
bm1 <- bm1[, .SD[which.max(date)], keyby = .(matchId, betType,
oddtype, totalhandicap)]
bm1[, `:=`(features, paste(bm1[, betType], bm1[, oddtype], sep = "_"))]
bm1_part2 <- bm1[is.na(totalhandicap)]
bm1_part1 <- bm1[!is.na(totalhandicap)]
bm1_part1[, `:=`(features, paste(bm1_part1[, features], bm1_part1[,
totalhandicap], sep = "_"))]
bm1 <- rbind(bm1_part2, bm1_part1)
bm1 <- bm1[order(matchId)]
Afterwards, the odd data is transformed such that the rows represent the different matches and the columns refer to the different odd types for the corresponding matches summing up to 117 features in total. Thus, each match can be treated as an observation for PCA. Since for any feature type for a match, there is only one odd information when the final odds are taken, the mean in the below code will not perform any averaging operation, but it will shape the table the way I want with NA’s.
featurespivot <- dcast(bm1, matchId ~ features, fun.aggregate = mean,
value.var = "odd")
The problem in this step is that not all matches have all types of odd types information. Hence, “featurespivot” table contains NA’s. In order to perform PCA, the data table should have no NA’s. They should either be imputed or those observations should be removed. The second option is applied here. However, for some odd types, there are very few observations. Asian handicap odds can be given as an example. To be more specific, out of 3084 matches, 10Bet have only 2 matches with the home win with the asian handicap with handicap 2.5. Thus, if these kind of odd types are included, the full observations will be too few.Therefore, those columns with more than 1500 NA’s are removed in the first place. It is ensured that after removing those NA’s, there are enough number of features that will explain the match outcome. The problematic issue here is how to determine the maximum number of NA’s that will be left in each column before taking the complete cases. For this problem, trial and error is done and it is decided such that in the end, there are enough match observations and enough features, and there are not too much match observations as well in order to visualize the observations and get some meaning from the eigenvectors. Otherwise, the plot would be too messy to get an understanding. Afterwards, complete cases are taken into account for PCA.
na_count_in_cols <- as.matrix(apply(featurespivot, 2, function(x) sum(is.na(x))))
## less than 1000 na
few_na <- apply(na_count_in_cols, 1, function(x) x < 1500)
few_na <- few_na[2:length(na_count_in_cols)] #omit match id column
# transposing the features table in order to remove the
# columns with many NA's much more easily
t_featurespivot <- dcast(melt(featurespivot, id.vars = "matchId"),
variable ~ matchId)
t_featurespivot_updated <- t_featurespivot[few_na]
setnames(t_featurespivot_updated, old = "variable", new = "oddtypes")
# transposing the features table bact to the original
featurespivot_updated <- dcast(melt(t_featurespivot_updated,
id.vars = "oddtypes"), variable ~ oddtypes)
setnames(featurespivot_updated, old = "variable", new = "matchId")
# omitting matches that have any odd information as NA
complete_rows <- complete.cases(featurespivot_updated)
complete_updated <- featurespivot_updated[complete_rows]
bm1_dist <- complete_updated[, 2:ncol(complete_updated)] #for multidimensional scaling part
After preparing the table ready for PCA with 173 observations and 20 features, an indicator whether the corresponding match finished over or not is determined with the following code. That information is kept in “match_isover” table.
## score column of the matches
match_scores <- data_matches[, .(matchId, score)]
complete_updated_w_scores <- left_join(complete_updated, match_scores,
by = c(matchId = "matchId"))
complete_updated_w_scores <- complete_updated_w_scores %>% separate(score,
c("home_goals", "away_goals"), ":")
complete_updated_w_scores$home_goals <- as.numeric(complete_updated_w_scores$home_goals)
complete_updated_w_scores$away_goals <- as.numeric(complete_updated_w_scores$away_goals)
complete_updated_w_scores$total_goals <- complete_updated_w_scores$home_goals +
complete_updated_w_scores$away_goals
# adding indicator column whether a match is finished
# 'over'(more than 2.5 goals) or not
complete_updated_w_scores <- as.data.frame(complete_updated_w_scores)
complete_updated_w_scores[, "is_over"] <- lapply(complete_updated_w_scores,
FUN = function(x) ifelse(complete_updated_w_scores$total_goals >
2.5, 1, 0))
complete_updated_w_scores <- as.data.table(complete_updated_w_scores)
complete_updated_w_scores[, `:=`(home_goals = NULL, away_goals = NULL,
total_goals = NULL)]
match_isover <- complete_updated_w_scores[, .(matchId, is_over)]
bm1_is_over <- complete_updated_w_scores$is_over #for multidimensional scaling
As the next step, PCA is performed with “princomp” function. The scaling is performed when “cor” is equated to True. Thus, there is no need to perform scaling additionally before applying PCA.
## pca
pca_bm1 <- princomp(complete_updated[, -1], cor = TRUE, scores = TRUE)
summary(pca_bm1)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 3.3211456 2.2471793 1.6154772 0.63308407 0.49852232
## Proportion of Variance 0.5515004 0.2524907 0.1304883 0.02003977 0.01242623
## Cumulative Proportion 0.5515004 0.8039911 0.9344795 0.95451925 0.96694547
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.385359238 0.365424257 0.303112946 0.287106286
## Proportion of Variance 0.007425087 0.006676744 0.004593873 0.004121501
## Cumulative Proportion 0.974370559 0.981047303 0.985641176 0.989762677
## Comp.10 Comp.11 Comp.12 Comp.13
## Standard deviation 0.234238671 0.216367380 0.160317952 0.1276367374
## Proportion of Variance 0.002743388 0.002340742 0.001285092 0.0008145568
## Cumulative Proportion 0.992506065 0.994846807 0.996131900 0.9969464564
## Comp.14 Comp.15 Comp.16 Comp.17
## Standard deviation 0.1220791014 0.1101500055 0.1022866031 0.0907054647
## Proportion of Variance 0.0007451653 0.0006066512 0.0005231275 0.0004113741
## Cumulative Proportion 0.9976916217 0.9982982729 0.9988214004 0.9992327744
## Comp.18 Comp.19 Comp.20
## Standard deviation 0.0782809515 0.0749367215 0.0600090954
## Proportion of Variance 0.0003063954 0.0002807756 0.0001800546
## Cumulative Proportion 0.9995391698 0.9998199454 1.0000000000
plot(pca_bm1, xlab = "Components", main = "Variance along the Principal Components")
According to the summary above, the first principal component covers 55 % of the variance in the data. With 2 principal components, 80 % of the variance is covered. When the third component is added, 93.4 % of the variance is covered. Thus, in this task, it is decided to lower the dimension to 3 so that with just 3 dimensions, almost 94% of the variability in the data is preserved.
# new coordinates
new_bm1 <- pca_bm1$scores
new_bm1 <- as.data.table(new_bm1)
# new_bm1_2ev<-new_bm1[,1:2]
# new_bm1_2ev<-cbind(new_bm1_2ev,match_isover)
# plot(new_bm1_2ev$Comp.1,new_bm1_2ev$Comp.2,col=new_bm1_2ev$is_over+3)
new_bm1_3ev <- new_bm1[, 1:3]
new_bm1_3ev_1 <- cbind(new_bm1_3ev, match_isover) #with indicator column
For plotting, “plot_ly” package is utilized. In the plot, the x axis corresponds to the first principal component, the y axis corresponds to the second principal component, and the z axis shows the third principal component. Furthermore, the red points are the matches that finished under 2.5 and blue points over 2.5 as it is shown in the legend as “0” and “1” respectively.
The advantage with the 3D plots of this package is that it allows rotating in any direction so that all binary combinations of the principal components can be analyzed together in addition to looking at the principal components seperately. The plot can be seen below:
plot_ly(new_bm1_3ev_1, opacity = 0.7, x = ~Comp.1, y = ~Comp.2,
z = ~Comp.3, color = ~as.factor(is_over), colors = c("#BF382A",
"#0C4B8E")) %>% add_markers(size = 2) %>% layout(title = "The Mapping of The Matches",
scene = list(xaxis = list(title = "EigenVector_1"), yaxis = list(title = "EigenVector_2"),
zaxis = list(title = "EigenVector_3")))
It can be seen from the plot that as the eigenvectors 1 and 2 become more negative, the majority of those matches finished under 2.5 as shown with red points. Also, as the mapped value to the eigenvector 3 gets higher, the majority of the matches become “over”. So, now the features affecting these eigenvectors should be checked.
(loadings <- loadings(pca_bm1)[, 1:3])
## Comp.1 Comp.2 Comp.3
## 1x2_odd1 0.019174966 0.38147350 -0.307142511
## 1x2_odd2 -0.108307616 -0.39790861 -0.139169975
## 1x2_oddX -0.189392763 -0.12271889 -0.430529957
## ah_1_0 0.009461639 0.37496521 -0.325177377
## ah_2_0 -0.082639280 -0.38501064 -0.043684666
## bts_NO -0.153500769 0.15185126 0.445960035
## bts_YES 0.156761388 -0.17274473 -0.422345851
## dc_12 0.207816258 0.07845759 0.357586216
## dc_1X 0.012139277 0.39482448 -0.267408388
## dc_X2 -0.113878884 -0.39611205 -0.108405338
## ou_over_2.25 0.285760475 -0.06371676 -0.026020245
## ou_over_2.5 0.290251981 -0.05058079 -0.016917072
## ou_over_2.75 0.291691174 -0.04946104 -0.025961029
## ou_over_3 0.288009608 -0.05527672 -0.029063615
## ou_over_3.5 0.290650764 -0.05622816 -0.038844255
## ou_under_2.25 -0.289295955 0.04251393 0.006732435
## ou_under_2.5 -0.293623141 0.02130930 -0.024295925
## ou_under_2.75 -0.293400328 0.03485591 0.015141534
## ou_under_3 -0.282971446 0.03109203 -0.002750260
## ou_under_3.5 -0.293587666 0.02614362 0.006633720
In order the mapped values to the eigenvector 1 and 2 to become more negative, the odd corresponding the the match result as away win (1x2_odd2), the odd corresponding the the match result as tie (1x2_oddX) or the odd corresponding the double chance of tie and away win (dc_X2) should be high. Thus, in the case of these odds being very high, it can be concluded that the match is more likely to finish “under”. Additionally, the eigenvector 3 becomes more positive as the odd corresponding to not both to score (bts_NO) and the double chance of home and away win odd (dc_12) get higher. So, it means that when these odds are very high, the match is more likely to finish “over”.
The same procedure is applied in order to perform PCA on the odds data from the bookmaker 1xBet. In this case, after transforming the data into features table, the columns with more than 700 NA’s are removed before omitting incomplete observations. At the end, the features table has 430 observations and 46 variables.
pca_bm1 <- princomp(complete_updated[, -1], cor = TRUE, scores = TRUE)
summary(pca_bm1)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 4.7237547 3.9612532 1.54474447 1.10349179
## Proportion of Variance 0.4850839 0.3411202 0.05187468 0.02647161
## Cumulative Proportion 0.4850839 0.8262040 0.87807872 0.90455033
## Comp.5 Comp.6 Comp.7 Comp.8
## Standard deviation 1.00027654 0.82160822 0.633031062 0.589848917
## Proportion of Variance 0.02175116 0.01467478 0.008711485 0.007563516
## Cumulative Proportion 0.92630149 0.94097627 0.949687759 0.957251275
## Comp.9 Comp.10 Comp.11 Comp.12
## Standard deviation 0.537406613 0.481646332 0.450090369 0.407317576
## Proportion of Variance 0.006278388 0.005043113 0.004403942 0.003606687
## Cumulative Proportion 0.963529663 0.968572776 0.972976718 0.976583406
## Comp.13 Comp.14 Comp.15 Comp.16
## Standard deviation 0.396595914 0.367312295 0.358776737 0.317274802
## Proportion of Variance 0.003419311 0.002933007 0.002798277 0.002188333
## Cumulative Proportion 0.980002717 0.982935724 0.985734001 0.987922334
## Comp.17 Comp.18 Comp.19 Comp.20
## Standard deviation 0.287143543 0.267345312 0.246546168 0.221707673
## Proportion of Variance 0.001792422 0.001553772 0.001321413 0.001068572
## Cumulative Proportion 0.989714756 0.991268528 0.992589941 0.993658513
## Comp.21 Comp.22 Comp.23 Comp.24
## Standard deviation 0.1814713488 0.1684488672 0.1629680845 0.1549714886
## Proportion of Variance 0.0007159098 0.0006168483 0.0005773608 0.0005220905
## Cumulative Proportion 0.9943744224 0.9949912707 0.9955686315 0.9960907220
## Comp.25 Comp.26 Comp.27 Comp.28
## Standard deviation 0.1526041851 0.1412978851 0.1391014957 0.124468961
## Proportion of Variance 0.0005062617 0.0004340237 0.0004206354 0.000336794
## Cumulative Proportion 0.9965969836 0.9970310074 0.9974516427 0.997788437
## Comp.29 Comp.30 Comp.31 Comp.32
## Standard deviation 0.1173776408 0.1090483803 0.1014926997 0.0958201151
## Proportion of Variance 0.0002995111 0.0002585119 0.0002239297 0.0001995977
## Cumulative Proportion 0.9980879478 0.9983464597 0.9985703895 0.9987699872
## Comp.33 Comp.34 Comp.35 Comp.36
## Standard deviation 0.0906487022 0.0843444600 0.0823198068 0.0790205252
## Proportion of Variance 0.0001786345 0.0001546519 0.0001473163 0.0001357444
## Cumulative Proportion 0.9989486217 0.9991032736 0.9992505899 0.9993863343
## Comp.37 Comp.38 Comp.39 Comp.40
## Standard deviation 0.0732427718 6.465682e-02 6.186305e-02 5.626826e-02
## Proportion of Variance 0.0001166196 9.088054e-05 8.319647e-05 6.882863e-05
## Cumulative Proportion 0.9995029540 9.995938e-01 9.996770e-01 9.997459e-01
## Comp.41 Comp.42 Comp.43 Comp.44
## Standard deviation 5.429812e-02 4.951916e-02 4.665384e-02 4.114830e-02
## Proportion of Variance 6.409317e-05 5.330756e-05 4.731698e-05 3.680832e-05
## Cumulative Proportion 9.998100e-01 9.998633e-01 9.999106e-01 9.999474e-01
## Comp.45 Comp.46
## Standard deviation 3.643294e-02 3.305905e-02
## Proportion of Variance 2.885563e-05 2.375871e-05
## Cumulative Proportion 9.999762e-01 1.000000e+00
plot(pca_bm1, xlab = "Components", main = "Variance along the Principal Components")
According to the PCA summary above, the first principal component covers almost 48.5% of the variance in the data. With 2 principal components, 82.6% of the variance is covered. So, the first two principal components in this case covers less variance than the previous case of 10Bet. When the third component is added, 87.8% of the variance is covered.
The plot of the new coordinates can be seen below:
It can be seen from the plot that the matches that are mapped to the more negative values in the first eigenvector finished “over” mostly. Also, those that are mapped to the more positive values in the same eigenvector finished “under” mostly. Additionaly, matches that are mapped to the positive side of the eigenvector 2 are mostly finished “over”. One last point is that the most positive mapped values to the third eigenvector correspond to the matches that finished “over”. Now, the features affecting these eigenvectors should be checked.
## Comp.1 Comp.2 Comp.3
## 1x2_odd1 -0.001361114 0.2392963474 -0.1836800797
## 1x2_odd2 -0.008508270 -0.2436709343 -0.1340710304
## 1x2_oddX -0.153375974 -0.0701037794 -0.3754021775
## ah_1_+1 -0.044802617 0.2257415167 -0.2151141210
## ah_1_+1.5 -0.053042796 0.2303659730 -0.1841646593
## ah_1_+2 -0.063760854 0.1917561278 -0.2404295623
## ah_1_-1 0.015964879 0.2301387332 -0.1884267863
## ah_1_-1.5 0.027983065 0.2361319346 -0.1731631893
## ah_1_-2 0.038421143 0.1957760708 -0.0833746827
## ah_1_-2.5 0.041697888 0.2108321623 -0.1153117316
## ah_1_0 -0.019632897 0.2363887734 -0.2026016472
## ah_2_+1 0.024566542 -0.2311614055 -0.0436524178
## ah_2_+1.5 0.035636908 -0.2339028103 -0.0399148678
## ah_2_+2 0.024525182 -0.1818570925 0.0270007758
## ah_2_-1 -0.059700810 -0.2230126131 -0.2228576887
## ah_2_-1.5 -0.064808234 -0.2240586618 -0.2029446480
## ah_2_-2 -0.072013291 -0.1920379770 -0.3018349792
## ah_2_-2.5 -0.092840941 -0.1770930861 -0.3058966338
## ah_2_0 -0.025576046 -0.2386777014 -0.1750524324
## bts_NO -0.190343635 0.0423148541 0.1644141671
## bts_YES 0.186170415 -0.0514370236 -0.1545463053
## dc_12 0.170416392 0.0478136148 0.2593289290
## dc_1X -0.030981664 0.2384457040 -0.1768050905
## dc_X2 -0.040510351 -0.2394911167 -0.1482548849
## ou_over_0.5 0.153944363 -0.0073370195 -0.0272052611
## ou_over_1 0.186699171 -0.0084258167 -0.0799186778
## ou_over_1.5 0.203238752 -0.0130820797 -0.0764497020
## ou_over_2 0.204676901 -0.0100519317 -0.0544358369
## ou_over_2.5 0.207047981 -0.0112696281 -0.0456429501
## ou_over_3 0.204456979 -0.0150842603 -0.0613690075
## ou_over_3.5 0.204811501 -0.0148098696 -0.0646779639
## ou_over_4 0.196507051 -0.0209775602 -0.0647054425
## ou_over_4.5 0.199282391 -0.0211777531 -0.0660236771
## ou_over_5 0.193870182 -0.0163187940 -0.0835623083
## ou_over_5.5 0.165048867 -0.0062939533 -0.1154154810
## ou_under_0.5 -0.179434055 -0.0008869490 -0.0258380463
## ou_under_1 -0.194095736 0.0003348821 0.0399426705
## ou_under_1.5 -0.202208335 0.0070187955 0.0486555670
## ou_under_2 -0.202644786 -0.0061301434 -0.0156851658
## ou_under_2.5 -0.207251445 -0.0005939484 -0.0092559362
## ou_under_3 -0.204002635 0.0014873794 0.0011624075
## ou_under_3.5 -0.205176983 0.0035870897 0.0088590638
## ou_under_4 -0.200171016 0.0180504430 0.0220308277
## ou_under_4.5 -0.203490908 0.0144520034 0.0138891142
## ou_under_5 -0.183279596 0.0129502628 -0.0152068006
## ou_under_5.5 -0.174807730 0.0006905152 0.0008482545
In order the mapped values to the eigenvector 1 become more negative, the odd corresponding the the match result as tie (1x2_oddX) and the odd corresponding to not both to score(bts_NO) should be high. Also, bts_NO makes the mapped value to the third eigenvector more positive. Additionally, as the odd corresponding to the home win (1x2_odd1) gets bigger, the mapped value to the eigenvector 2 becomes more positive. Thus, in case of these odds being very high, it can be concluded that the match is more likely to finish “over”. On the other hand, the mapped value to the eigenvector 1 becomes more positive as the of both to score yes (bts_YES) increases. So, it means that when this odd is very high, the match is more likely to finish “under”.
The same procedure is applied in order to perform PCA on the odds data from the bookmaker Betsson. In this case, after transforming the data into features table, the columns with more than 2500 NA’s are removed before omitting incomplete observations. At the end, the features table has 589 observations and 22 variables.
pca_bm1 <- princomp(complete_updated[, -1], cor = TRUE, scores = TRUE)
summary(pca_bm1)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 3.5857621 2.2662043 1.7373199 0.58792167 0.52317671
## Proportion of Variance 0.5844404 0.2334401 0.1371946 0.01571145 0.01244154
## Cumulative Proportion 0.5844404 0.8178805 0.9550751 0.97078654 0.98322808
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.334039431 0.239059216 0.230598109 0.175587832
## Proportion of Variance 0.005071925 0.002597696 0.002417068 0.001401413
## Cumulative Proportion 0.988300007 0.990897703 0.993314771 0.994716184
## Comp.10 Comp.11 Comp.12 Comp.13
## Standard deviation 0.165967083 0.148925841 0.1207321429 0.1090353049
## Proportion of Variance 0.001252049 0.001008132 0.0006625568 0.0005403954
## Cumulative Proportion 0.995968233 0.996976365 0.9976389216 0.9981793170
## Comp.14 Comp.15 Comp.16 Comp.17
## Standard deviation 0.1015949186 0.0912441092 0.0790113902 0.0643488567
## Proportion of Variance 0.0004691603 0.0003784312 0.0002837636 0.0001882171
## Cumulative Proportion 0.9986484773 0.9990269086 0.9993106722 0.9994988893
## Comp.18 Comp.19 Comp.20 Comp.21
## Standard deviation 0.057658160 0.0512962456 0.0471917891 0.0408687055
## Proportion of Variance 0.000151112 0.0001196048 0.0001012302 0.0000759205
## Cumulative Proportion 0.999650001 0.9997696060 0.9998708362 0.9999467567
## Comp.22
## Standard deviation 3.422502e-02
## Proportion of Variance 5.324327e-05
## Cumulative Proportion 1.000000e+00
plot(pca_bm1, xlab = "Components", main = "Variance along the Principal Components")
According to the PCA summary above, the first principal component covers almost 58.4% of the variance in the data. With 2 principal components, 81.8% of the variance is covered. When the third component is added, 95.5% of the variance is covered. So, the first three principal components in this case covers much more variance than the previous case of 1xBet.
The plot of the new coordinates can be seen below:
It can be seen from the plot that the matches that are mapped to the more negative values in the eigenvectors 2 and 3 finished “under” mostly. Additionaly, matches that are mapped to the more positive side of the eigenvectors 1 and 3 finished mostly “over”. Now, the features affecting these eigenvectors should be checked.
## Comp.1 Comp.2 Comp.3
## 1x2_odd1 0.01131900 0.357658016 -0.332533126
## 1x2_odd2 0.10220900 -0.403828760 -0.073737689
## 1x2_oddX 0.20421126 -0.130569983 -0.344653875
## bts_NO 0.11986996 0.205880036 0.430010007
## bts_YES -0.09945187 -0.198910003 -0.459814434
## dc_12 -0.20849544 0.097089732 0.339051243
## dc_1X 0.02003749 0.372179971 -0.303121181
## dc_X2 0.10965604 -0.403228881 -0.042352132
## ha_1 0.02181793 0.352246274 -0.340651199
## ha_2 0.10972310 -0.398600459 -0.081967338
## ou_over_0.5 -0.25797956 -0.036410612 0.042122985
## ou_over_1.5 -0.26964955 -0.043770426 0.003828956
## ou_over_2.5 -0.27336677 -0.045220531 -0.024733366
## ou_over_3.5 -0.27065997 -0.049535960 -0.055283132
## ou_over_4.5 -0.26858508 -0.048547710 -0.068900120
## ou_over_5.5 -0.25648117 -0.043612117 -0.089442956
## ou_under_0.5 0.26332837 0.005425811 -0.074859072
## ou_under_1.5 0.27171662 0.028442787 -0.013692676
## ou_under_2.5 0.27368945 0.034256400 0.010553346
## ou_under_3.5 0.27287854 0.033272586 0.049286978
## ou_under_4.5 0.26950237 0.030234158 0.061335254
## ou_under_5.5 0.25437970 0.021745039 0.084665667
In order the mapped values to the eigenvectors 2 and 3 be more negative, the odd corresponding the the match result as tie (1x2_oddX) and the odd corresponding to both to score (bts_YES) should be high. When these odds are high, the match is more likely to finish “under”. On the other hand, the mapped values to the eigenvectors 1 and 3 become more positive as the corresponding to not both to score (bts_NO) gets high. In such situation, the match is more likely to finish “over”.
The same procedure is applied in order to perform PCA on the odds data from the bookmaker 888sport. In this case, after transforming the data into features table, the columns with more than 2500 NA’s are removed before omitting incomplete observations. At the end, the features table has 328 observations and 24 variables.
pca_bm1 <- princomp(complete_updated[, -1], cor = TRUE, scores = TRUE)
summary(pca_bm1)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 3.9260558 2.1369776 1.7010492 0.69846999 0.53156817
## Proportion of Variance 0.6422464 0.1902781 0.1205654 0.02032751 0.01177353
## Cumulative Proportion 0.6422464 0.8325245 0.9530898 0.97341735 0.98519088
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.308292434 0.249387787 0.242667092 0.200552798
## Proportion of Variance 0.003960176 0.002591428 0.002453638 0.001675893
## Cumulative Proportion 0.989151056 0.991742484 0.994196122 0.995872015
## Comp.10 Comp.11 Comp.12 Comp.13
## Standard deviation 0.1437633456 0.1338889194 0.1251477548 0.1146722008
## Proportion of Variance 0.0008611625 0.0007469268 0.0006525817 0.0005479047
## Cumulative Proportion 0.9967331774 0.9974801041 0.9981326858 0.9986805906
## Comp.14 Comp.15 Comp.16 Comp.17
## Standard deviation 0.0904906548 0.0833550606 0.0694351843 0.0528278653
## Proportion of Variance 0.0003411899 0.0002895028 0.0002008852 0.0001162826
## Cumulative Proportion 0.9990217805 0.9993112833 0.9995121685 0.9996284511
## Comp.18 Comp.19 Comp.20 Comp.21
## Standard deviation 4.723168e-02 4.262229e-02 0.0397120238 3.681217e-02
## Proportion of Variance 9.295134e-05 7.569413e-05 0.0000657102 5.646401e-05
## Cumulative Proportion 9.997214e-01 9.997971e-01 0.9998628068 9.999193e-01
## Comp.22 Comp.23 Comp.24
## Standard deviation 2.951224e-02 2.611727e-02 1.960656e-02
## Proportion of Variance 3.629051e-05 2.842133e-05 1.601738e-05
## Cumulative Proportion 9.999556e-01 9.999840e-01 1.000000e+00
plot(pca_bm1, xlab = "Components", main = "Variance along the Principal Components")
According to the PCA summary above, the first principal component covers almost 64.2% of the variance in the data. With 2 principal components, 83.2% of the variance is covered. When the third component is added, 95.3% of the variance is covered. So, the first three principal components in this case covers almost the same variance with the previous case.
The plot of the new coordinates can be seen below:
It can be seen from the plot that the matches that are mapped to the more negative values in the eigenvectors 1 and 2 and at the same time to the more positive values in the eigenvector 3 finished “over” mostly. Similarly, matches that are mapped to the more positive side of the eigenvector 2 and the more negative side of the eigenvector 3 finished “over” most of the time. Now, the features affecting these eigenvectors should be checked.
## Comp.1 Comp.2 Comp.3
## 1x2_odd1 0.02550831 0.37411803 0.34445083
## 1x2_odd2 -0.16671534 -0.34839670 0.04664796
## 1x2_oddX -0.21630701 -0.11165998 0.26282681
## bts_NO 0.05971545 0.23070090 -0.48034092
## bts_YES -0.07844963 -0.24633871 0.45402974
## dc_12 0.21165664 0.07815341 -0.28358477
## dc_1X 0.02649210 0.38421580 0.32649317
## dc_X2 -0.17437476 -0.33842782 0.02257083
## ha_1 0.01082465 0.36771921 0.35787807
## ha_2 -0.17203169 -0.33724299 0.06283369
## ou_over_0.5 0.22734745 -0.05908301 -0.02337790
## ou_over_1.5 0.24209351 -0.08410300 0.05611729
## ou_over_2.5 0.24481421 -0.09282019 0.05561027
## ou_over_3.5 0.24335499 -0.09398048 0.07241751
## ou_over_4.5 0.24120613 -0.09694755 0.06860790
## ou_over_5.5 0.23830814 -0.09706741 0.07735791
## ou_over_6.5 0.23832176 -0.09086958 0.06751342
## ou_under_0.5 -0.23855335 0.03913074 0.01332500
## ou_under_1.5 -0.24497760 0.06702849 -0.05668855
## ou_under_2.5 -0.24748806 0.07998287 -0.05325581
## ou_under_3.5 -0.24583972 0.08363030 -0.07041717
## ou_under_4.5 -0.24421431 0.08709352 -0.06685929
## ou_under_5.5 -0.24099330 0.08835179 -0.08012155
## ou_under_6.5 -0.22777067 0.07191366 -0.06572334
In order the mapped values to the eigenvectors 1 and 2 be more negative and to the eigenvector 3 be more positive at the same time, the odd corresponding the the match result as tie (1x2_oddX) should be high. Furthermore, the mapped values to the eigenvector 2 become more positive while the mapped values to the eigenvector 3 become more negative when the odd corresponding to not both to score (bts_NO) gets high. In such situations, the match is more likely to finish “over”.
The same procedure is applied in order to perform PCA on the odds data from the bookmaker BetVictor In this case, after transforming the data into features table, the columns with more than 2500 NA’s are removed before omitting incomplete observations. At the end, the features table has 405 observations and 26 variables.
pca_bm1 <- princomp(complete_updated[, -1], cor = TRUE, scores = TRUE)
summary(pca_bm1)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 3.6877739 3.0539578 0.92876520 0.76016017
## Proportion of Variance 0.5230645 0.3587176 0.03317711 0.02222475
## Cumulative Proportion 0.5230645 0.8817821 0.91495920 0.93718395
## Comp.5 Comp.6 Comp.7 Comp.8
## Standard deviation 0.59839712 0.54904918 0.44016236 0.377073147
## Proportion of Variance 0.01377227 0.01159442 0.00745165 0.005468621
## Cumulative Proportion 0.95095622 0.96255065 0.97000230 0.975470917
## Comp.9 Comp.10 Comp.11 Comp.12
## Standard deviation 0.34888773 0.311485182 0.281192432 0.267351515
## Proportion of Variance 0.00468164 0.003731655 0.003041122 0.002749109
## Cumulative Proportion 0.98015256 0.983884212 0.986925335 0.989674444
## Comp.13 Comp.14 Comp.15 Comp.16
## Standard deviation 0.226101348 0.196021873 0.185064695 0.166989454
## Proportion of Variance 0.001966224 0.001477868 0.001317267 0.001072518
## Cumulative Proportion 0.991640667 0.993118536 0.994435803 0.995508321
## Comp.17 Comp.18 Comp.19 Comp.20
## Standard deviation 0.1522792906 0.1392100900 0.1308646808 0.1158790485
## Proportion of Variance 0.0008918839 0.0007453634 0.0006586756 0.0005164598
## Cumulative Proportion 0.9964002049 0.9971455683 0.9978042439 0.9983207036
## Comp.21 Comp.22 Comp.23 Comp.24
## Standard deviation 0.1077244195 0.092255764 0.0873052159 0.0778248413
## Proportion of Variance 0.0004463289 0.000327351 0.0002931616 0.0002329502
## Cumulative Proportion 0.9987670325 0.999094383 0.9993875451 0.9996204953
## Comp.25 Comp.26
## Standard deviation 0.0742766475 0.0659553033
## Proportion of Variance 0.0002121931 0.0001673116
## Cumulative Proportion 0.9998326884 1.0000000000
plot(pca_bm1, xlab = "Components", main = "Variance along the Principal Components")
According to the PCA summary above, the first principal component covers almost 52.3% of the variance in the data. With 2 principal components, 88.1% of the variance is covered. When the third component is added, 91.5% of the variance is covered.
The plot of the new coordinates can be seen below:
It can be seen from the plot that the matches that are mapped to the more positive values in the eigenvectors 1 and 2 and at the same time to the more negative values in the eigenvector 3 finished “under” mostly. On the other hand, matches that are mapped to the more negative side of the eigenvectors 1 and 2 finished “over” most of the time. Now, the features affecting these eigenvectors should be checked.
## Comp.1 Comp.2 Comp.3
## 1x2_odd1 0.1642696 -0.24961090 -0.044123619
## 1x2_odd2 -0.1419947 0.26748116 0.022072822
## 1x2_oddX -0.2287865 0.10560571 0.010323287
## ah_1_-0.5 0.1646731 -0.24924990 0.062097018
## ah_1_0/-0.5 0.1391918 -0.25864301 0.017958790
## ah_2_-0.5 -0.1742849 0.23678995 0.081597926
## ah_2_0/-0.5 -0.1483344 0.25607336 0.120331667
## bts_NO -0.1586216 -0.24877171 0.070823136
## bts_YES 0.1595828 0.24493304 -0.119565515
## dc_12 0.2300820 -0.07267392 0.202535569
## dc_1X 0.1274862 -0.27864261 -0.046019951
## dc_X2 -0.1760688 0.24042748 -0.019831511
## ha_1 0.1392810 -0.26860704 -0.019442312
## ha_2 -0.1560846 0.25346378 0.029722345
## ou_over_0.5 0.2152204 0.11643699 0.043739632
## ou_over_1.5 0.2368555 0.13302375 0.028680818
## ou_over_2.5 0.2380662 0.14195548 -0.008260955
## ou_over_3.5 0.2347880 0.14075575 -0.057275056
## ou_over_4.5 0.2292862 0.13621529 -0.187923675
## ou_over_5.5 0.1866135 0.12059959 -0.619099966
## ou_under_0.5 -0.1865696 -0.08904228 -0.642748031
## ou_under_1.5 -0.2376810 -0.13514439 -0.184358975
## ou_under_2.5 -0.2395613 -0.14122896 -0.099146455
## ou_under_3.5 -0.2369421 -0.13875527 -0.123403298
## ou_under_4.5 -0.2271248 -0.13256444 0.067945581
## ou_under_5.5 -0.2214153 -0.13434782 0.081025001
In order the mapped values to the eigenvectors 1 and 2 be more positive and to the eigenvector 3 be more negative at the same time, the odd corresponding to both to score (bts_YES), over 4.5 goals (ou_over_4.5) and over 5.5 goals (ou_over_5.5) should be high. In this case, the match is more likely to finish “under”. However, the mapped values to the eigenvectors 1 and 2 become more negative when the odd corresponding to not both to score (bts_NO) and the match finishing under odds with different thresholds gets high. In such situations, the match is more likely to finish “over”.
The similar strategy with PCA part above is followed in order to apply multidimensional scaling (MDS). However, there are some differences in these two methods. Before forming the distance matrix, scaling is done with scale function. Standardization is done with equating center and scale parameters to true in scale function. Afterwards, euclidean distance is calculated for each row representing matches. As the next step, MDS is applied, and 3 new dimensions are obtained. Since 3 eigenvectors are considered in PCA part, 3 dimensions are taken into account in MDS part as well. With the similar plotting strategy, the new coordinates can be seen below:
scaled_bm1_dist <- scale(bm1_dist, center = TRUE, scale = TRUE)
bm1_dist_matrix_euc <- as.matrix(dist(scaled_bm1_dist, method = "euclidean"))
bm1_dist_matrix_euc <- as.data.table(bm1_dist_matrix_euc)
bm1_mds_euc = cmdscale(bm1_dist_matrix_euc, k = 3)
bm1_mds_euc <- cbind(bm1_mds_euc, bm1_is_over)
plot_ly(as.data.frame(bm1_mds_euc), opacity = 0.7, x = ~V1, y = ~V2,
z = ~V3, color = ~as.factor(bm1_is_over), colors = c("#BF382A",
"#0C4B8E")) %>% add_markers(size = 2) %>% layout(title = "The Mapping of The Matches (Euclidean)",
scene = list(xaxis = list(title = "Dimension 1"), yaxis = list(title = "Dimension 2"),
zaxis = list(title = "Dimension 3")))
Unlike PCA, the dimensions in MDS do not have a meaning. In PCA, meanings can be extracted from the eigenvalues and the eigenvectors as it is done in the previous part. However, with MDS, if there is any clustering can be observed so that similar observations can be catched by looking at the mapped points along 3 dimensions in this case. When the plot is observed, in the negative side of the dimension 1, dimension 2 and dimension 3, there is a group of matches that finished under. Also, close to zero in each dimension, there is a cluster of matches that finished over.
As the next step, Manhattan distances are calculated before applying MDS. MDS is applied with similar strategy. The 3D plot of mapped points can be seen below:
bm1_dist_matrix_man <- as.matrix(dist(scaled_bm1_dist, method = "manhattan"))
bm1_dist_matrix_man <- as.data.table(bm1_dist_matrix_man)
bm1_mds_man = cmdscale(bm1_dist_matrix_man, k = 3)
bm1_mds_man <- cbind(bm1_mds_man, bm1_is_over)
plot_ly(as.data.frame(bm1_mds_man), opacity = 0.7, x = ~V1, y = ~V2,
z = ~V3, color = ~as.factor(bm1_is_over), colors = c("#BF382A",
"#0C4B8E")) %>% add_markers(size = 2) %>% layout(title = "The Mapping of The Matches (Manhattan)",
scene = list(xaxis = list(title = "Dimension 1"), yaxis = list(title = "Dimension 2"),
zaxis = list(title = "Dimension 3")))
As it can be seen from the plot that the mapped points are different than the Euclidean distance case. Here, the clusters are less obvious. However, again close to zero in each dimension, there is a group of matches that finished over.
A similar strategy is performed while applying MDS on the odds of 1xBet.
The figure above shows that the matches that finished over are clustered at the point close to zero in each dimension. However, the matches that finished under seems to scatter in the dimension1-dimension2 plane.
The mapped points when the manhattan distance is calculated is again different than the one when euclidean distance is calculated. The cluster for the matches that finished over is less obvious in this case. However, it can be interpreted that those matches that finished over are somehow clustered in the area close to zero in each dimension. The under matches are scattered.
A similar strategy is performed while applying MDS on the odds of Betsson.
The figure above shows that the matches that finished over are clustered between 0 and 5 in dimension 1 and dimension 2 and between -2 and 2 in dimension 3 approximately. However, there is no obvious cluster for the mathches that finished under. They are scattered.
The mapped points when the manhattan distance is calculated is again different than the one when euclidean distance is calculated. The variation along the dimensions is a lot more in this case. In this case matches that finished over and under both scatter. There is no obvious cluster.
A similar strategy is performed while applying MDS on the odds of 888sport.
The figure above shows that the matches that finished over are clustered in the intersection of the positive side of dimension 1, the negative side of dimension 2 and in the center part of dimension 3. The matches that finished under are again scattered.
The mapped points when the manhattan distance is calculated is again different than the one when euclidean distance is calculated. The variation along the dimensions is a lot more in this case, too. However, it can be observed that in the intersection of the positive side of dimension 1 and 2, the matches mostly finished as over. For the mathches that finished under, there is no obvious cluster again.
A similar strategy is performed while applying MDS on the odds of BetVictor.
In the figure above,the matches that finished over and the ones finished under seem very scattered. However, it can be observed that in the positive side of dimension 3, there are more matches that finished over. Similarly, in the negative side of dimension 3, there are more mathces that finished under. However, the difference seems very slight.
The mapped points when the manhattan distance is calculated is again different than the one when euclidean distance is calculated. In this case, the points are a bit more clustered than the euclidean case. The majority of the matches in the part where dimension 1 value is close to zero finished over. Apart from that, there is no obvious cluster.
It can be seen that MDS results when Euclidean distance is calculated is very similar to the PCA results in terms of the mapped points and the distribution along the axis for each bookmaker. However, there is no such similarity between the Manhattan distance case and the PCA and also between the Manhattan distance case and the Euclidean distance case.
The objective of this task is to compress images using PCA. Throughout this task, a cat image is used.
In this question, the cat image is read as a variable:
library(jpeg)
cat <- readJPEG("/Users/elifkonyar/Desktop/IMG_2187.jpg")
The structure of the image is numeric large array. Its dimension is [512x512x3].
str(cat)
## num [1:512, 1:512, 1:3] 0.941 0.941 0.941 0.941 0.945 ...
The image is displayed below:
# display image
plot(c(1, 512), c(1, 512), type = "n", xlab = "", ylab = "",
main = "Cats of Boğaziçi")
rasterImage(cat, 1, 1, 512, 512, angle = 0, interpolate = FALSE)
In order to display each channel of the image, the 3d array is divided into 3 components so that “r_cat” is the red channel with the dimension 512x512, “g_cat” is the green channel, and the “b_cat” is the blue channel with the same dimension.
# red channel
r_cat <- cat
r_cat[, , 2] = 0
r_cat[, , 3] = 0
# green channel
g_cat <- cat
g_cat[, , 1] <- 0
g_cat[, , 3] <- 0
# blue channel
b_cat <- cat
b_cat[, , 1] <- 0
b_cat[, , 2] <- 0
Each channel of the image is displayed below:
par(mfrow = c(1, 3))
plot(c(1, 512), c(1, 512), type = "n", xlab = "", ylab = "",
main = "Red Channel")
rasterImage(r_cat, 1, 1, 512, 512, angle = 0, interpolate = FALSE)
plot(c(1, 512), c(1, 512), type = "n", xlab = "", ylab = "",
main = "Green Channel")
rasterImage(g_cat, 1, 1, 512, 512, angle = 0, interpolate = FALSE)
plot(c(1, 512), c(1, 512), type = "n", xlab = "", ylab = "",
main = "Blue Channel")
rasterImage(b_cat, 1, 1, 512, 512, angle = 0, interpolate = FALSE)
In this part, random noise from uniform distribution between 0 and 0.1 is added to each pixel value for each channel of the original image. add_unif() function below adds the random noise, and greater_than_1() function equates the values that are greater than 1 to 1 since each pixel value should be between 0 and 1.
library(abind)
add_unif <- function(x) {
x <- x + runif(1, 0, 0.1)
return(x)
}
greater_than_1 <- function(x) {
if (x > 1) {
return(1)
} else {
return(x)
}
}
These functions are applied to each channel of the cat image. Then, they are binded again to be an rgb-coloured image.
r_cat <- cat[, , 1]
r_cat_noisy <- apply(r_cat, 1:2, add_unif)
r_cat_noisy <- apply(r_cat_noisy, c(1, 2), greater_than_1)
g_cat <- cat[, , 2]
g_cat_noisy <- apply(g_cat, 1:2, add_unif)
g_cat_noisy <- apply(g_cat_noisy, c(1, 2), greater_than_1)
b_cat <- cat[, , 3]
b_cat_noisy <- apply(b_cat, 1:2, add_unif)
b_cat_noisy <- apply(b_cat_noisy, c(1, 2), greater_than_1)
cat_noisy <- abind(r_cat_noisy, g_cat_noisy, along = 3)
cat_noisy <- abind(cat_noisy, b_cat_noisy, along = 3)
The noisy image is displayed below:
# display the noisy image
plot(c(1, 512), c(1, 512), type = "n", xlab = "", ylab = "",
main = "Cats of Boğaziçi")
rasterImage(cat_noisy, 1, 1, 512, 512, angle = 0, interpolate = FALSE)
Each channel of the image is displayed below:
# red channel
r_cat_noisy <- cat_noisy
r_cat_noisy[, , 2] = 0
r_cat_noisy[, , 3] = 0
# green channel
g_cat_noisy <- cat_noisy
g_cat_noisy[, , 1] <- 0
g_cat_noisy[, , 3] <- 0
# blue channel
b_cat_noisy <- cat_noisy
b_cat_noisy[, , 1] <- 0
b_cat_noisy[, , 2] <- 0
# display each channel
par(mfrow = c(1, 3))
plot(c(1, 512), c(1, 512), type = "n", xlab = "", ylab = "",
main = "Red Channel")
rasterImage(r_cat_noisy, 1, 1, 512, 512, angle = 0, interpolate = FALSE)
plot(c(1, 512), c(1, 512), type = "n", xlab = "", ylab = "",
main = "Green Channel")
rasterImage(g_cat_noisy, 1, 1, 512, 512, angle = 0, interpolate = FALSE)
plot(c(1, 512), c(1, 512), type = "n", xlab = "", ylab = "",
main = "Blue Channel")
rasterImage(b_cat_noisy, 1, 1, 512, 512, angle = 0, interpolate = FALSE)
The cat image is transformed into grayscale image using image editor. Afterwards, the transformed image is read as a variable again. The same procedure for adding noise is implemented to the grayscale version.
cat_grayscale <- readJPEG("/Users/elifkonyar/Desktop/IMG_2187_2.jpg")
# create noisy grayscale image:
cat_grayscale_noisy <- apply(cat_grayscale, 1:2, add_unif)
cat_grayscale_noisy <- apply(cat_grayscale_noisy, c(1, 2), greater_than_1)
The noisy grayscale cat image is displayed below:
# display image
plot(c(1, 512), c(1, 512), type = "n", xlab = "", ylab = "",
main = "Cats of Boğaziçi")
rasterImage(cat_grayscale_noisy, 1, 1, 512, 512, angle = 0, interpolate = FALSE)
In the next step, the data matrix of the grayscale noisy cat image is created such that each row corresponds to the patch with 9 columns. The image contains 512x512 pixels; so that in total, there will be 510x510=260100 patches of size 9 corresponding to the 3x3 boxes. The data matrix is formed below:
patches = matrix(rep(c(0), each = 260100 * 9), nrow = 260100,
ncol = 9)
k <- 1
# there will be 510*510=260100 patches
for (i in 2:511) {
for (j in 2:511) {
patches[k, 1] <- cat_grayscale_noisy[i - 1, j - 1]
patches[k, 2] <- cat_grayscale_noisy[i - 1, j]
patches[k, 3] <- cat_grayscale_noisy[i - 1, j + 1]
patches[k, 4] <- cat_grayscale_noisy[i, j - 1]
patches[k, 5] <- cat_grayscale_noisy[i, j]
patches[k, 6] <- cat_grayscale_noisy[i, j + 1]
patches[k, 7] <- cat_grayscale_noisy[i + 1, j - 1]
patches[k, 8] <- cat_grayscale_noisy[i + 1, j]
patches[k, 9] <- cat_grayscale_noisy[i + 1, j + 1]
k <- k + 1
}
}
In order to perform PCA, princomp() function is used as in the previous questions. The scaling is again done automatically.
pca_image <- princomp(patches, cor = TRUE, scores = TRUE)
summary(pca_image)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 2.9302252 0.3484975 0.30309826 0.220160110
## Proportion of Variance 0.9540244 0.0134945 0.01020762 0.005385608
## Cumulative Proportion 0.9540244 0.9675189 0.97772656 0.983112165
## Comp.5 Comp.6 Comp.7 Comp.8
## Standard deviation 0.203392406 0.18394256 0.173252448 0.161393975
## Proportion of Variance 0.004596497 0.00375943 0.003335157 0.002894224
## Cumulative Proportion 0.987708662 0.99146809 0.994803248 0.997697472
## Comp.9
## Standard deviation 0.143953988
## Proportion of Variance 0.002302528
## Cumulative Proportion 1.000000000
plot(pca_image)
It can be seen from the summary that the first principal component covers 95.4 % of the variance. When the second principal component is added, 96.7 % of the total variance is covered. Additionally, 97.8 % of the variance is covered when the third principal component is added.
(eigenvectors <- pca_image$loadings)
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## [1,] -0.332 0.406 -0.347 -0.533 -0.187 -0.440 -0.147 -0.256
## [2,] -0.334 0.448 -0.118 -0.142 0.465 0.217 0.502 0.370
## [3,] -0.332 0.361 0.402 0.395 -0.427 -0.105 0.159 -0.426 -0.205
## [4,] -0.334 -0.460 0.133 0.320 -0.382 0.484 -0.222 0.353
## [5,] -0.336 0.231 0.621 0.411 -0.528
## [6,] -0.334 0.460 0.144 0.316 -0.374 -0.496 0.206 0.354
## [7,] -0.332 -0.362 -0.408 0.395 -0.426 -0.112 -0.154 0.425 -0.197
## [8,] -0.334 -0.450 -0.118 -0.139 0.480 -0.210 -0.500 0.356
## [9,] -0.332 -0.407 0.354 -0.533 -0.201 0.432 0.150 -0.248
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.111 0.111 0.111 0.111 0.111 0.111 0.111 0.111
## Cumulative Var 0.111 0.222 0.333 0.444 0.556 0.667 0.778 0.889
## Comp.9
## SS loadings 1.000
## Proportion Var 0.111
## Cumulative Var 1.000
In order to reconstruct the image, the first three principal components are obtained:
mapped_points <- pca_image$scores
mapped_points <- as.data.table(mapped_points)
first_ev <- mapped_points$Comp.1
second_ev <- mapped_points$Comp.2
third_ev <- mapped_points$Comp.3
Here, the values range from -6.146 to 5.171 in the first principal component. The range in the second principal component is from -4.12 to 2.9. Additionally, the range in the third principal component is from -4.475 to 4.015. These values are scaled with the col parameter in the image function.
first_ev_mat <- matrix(first_ev, nrow = 510, ncol = 510, byrow = TRUE)
image(1:510, 1:510, t(first_ev_mat[ncol(first_ev_mat):1, ]),
col = gray((0:255)/255), main = "First Principal Component")
second_ev_mat <- matrix(second_ev, nrow = 510, ncol = 510, byrow = TRUE)
image(1:510, 1:510, t(second_ev_mat[ncol(second_ev_mat):1, ]),
col = gray((0:255)/255), main = "Second Principal Component")
third_ev_mat <- matrix(third_ev, nrow = 510, ncol = 510, byrow = TRUE)
image(1:510, 1:510, t(third_ev_mat[ncol(third_ev_mat):1, ]),
col = gray((0:255)/255), main = "Third Principal Component")
As it is expected, the image from the first principal component is the closest one to the original image. As the variation covered in the components decrease, the display deteriorates.
As the next step, the first three components are plotted as 3x3 image:
eigenvectors_mat <- loadings(pca_image)[]
eigenvectors_mat <- as.data.frame(eigenvectors_mat)
first <- eigenvectors_mat[, 1]
first <- matrix(first, nrow = 3, ncol = 3, byrow = TRUE)
image(1:3, 1:3, t(first[ncol(first):1, ]), col = gray((0:255)/255),
main = "First Component")
In this image, it can be seen that the eigenvectors catch the corners where the corresponding elements have the highest value in the first eigenvector so that those edges affect that eigenvector positively the most. On the other hand, the element corresponding to the center has the most negative value in the first eigenvector. It affects the eigenvector negatively the most.
second <- eigenvectors_mat[, 2]
second <- matrix(second, nrow = 3, ncol = 3, byrow = TRUE)
image(1:3, 1:3, t(second[ncol(second):1, ]), col = gray((0:255)/255),
main = "Second Component")
This eigenvector catches the upper and the lower sides since the corresponding elements in the second eigenvector of upper corner are the most positive, whereas of lower corner are the most negative.
third <- eigenvectors_mat[, 3]
third <- matrix(third, nrow = 3, ncol = 3, byrow = TRUE)
image(1:3, 1:3, t(third[ncol(third):1, ]), col = gray((0:255)/255),
main = "Third Component")
This eigenvector catches the left and the right sides since the corresponding elements in the third eigenvector of the right corner are the most positive, whereas of the left corner are the most negative.